Comparison of SVM & MLP for Numerai stock prediction

INM427 Neural Computing | RAJANI MOHAN JANIPALLI | City Univeristy of London

About data set

Data set source: https://www.kaggle.com/datasets/numerai/encrypted-stock-market-data-from-numerai

The data set is high quality financial market data of global equities collected by Numerai that has been cleaned and regularized and obfuscated to secure the value of the data set while simultaneously retaining its unique features for the purpose of building predictive models with machine learning. Numerai gives away data so that users around the world have free, hedge-fund quality data to build their machine learning models, using which a quant hedge fund is built. Each Instance corresponds to a stock at a particular time period. The features describe the various quantitive attributes of the stock at the time. The aim is to build a model to predict the future target using the features that correpond to the current market.

Reference:

https://docs.numer.ai/tournament/learn

https://numerai.fund/ (see Answers at the bottom of the page).

Import intiallly required libraries.

Coding Reference for all pandas related commands:

https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide

Load the data into a pandas data frame.

See the number of rows and columns of the data frame.

Have a glance of first five rows of the data frame.

Data frame Information - Check the names, data type and the number of non-missing values for all the columns in the data frame.

Applying seaborn plot style to all the plots further.

Exploratory data analysis

As mentioned in the "About data" section above, the data is cleaned and regularized. But, its still a good procedure to check for any scope of cleaning and regularization of data through exploratory data analysis.

Plot boxplots of all the columns of the data frame to check outliers and quartiles.

Plotting all the features in sets of five, to check their distribution.

Plot histogram of the target column to visual check if there is an imbalance.

Check the counts of both classes of the target columns to see the proportion of both the classes.

Ploting correlation matrix of all the columns.

From the correlation matrix, there doesn't seem to be a need to remove any feature before fitting the models.

Check the descriptive statistics of all the columns.

Create seperate sets of features and target.

Import library for splitting data into training ana test sets.

Create a test set of 20% of the data from original data set.

NOTE: Since the target values are almost balanced, stratification of split based on target column was not done.

Coding Reference:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Check the number of rows and columns of features and target of both training and test sets of data.

Import library to save data and models.

Save a test set of features into a pickle file.

Coding Refernce for pickle file creation:

https://www.youtube.com/watch?v=KfnhNlD8WZI

Save a test set of target into a pickle file.

Preperation of SVM model

Import library for SVM classifier module.

Create a baseline model as a start, without assigning any arguments to the SVM classifier object.

Import library for performing cross validation.

Perform a 5 fold cross validation of the baseline model with complete training data set.

Take the average of cross validation scores for all the 5 folds.

Check the parameteres of the baseline model.

To improve the performance of the model, hyperparameter tuning is the way and one standard way to do it to perform series of grid search for different hyperparameters. But, to perform grid search with all the 21 features is computationally expensive. So, principle component analysis is to be done to create components that capture the variance among features of the data and represent them in lower dimensions. Variance is key factor in training of models. Hence, its computationally efficient to perform grid search on the principle components with lower dimension and find the best model for principle components. Thereafter, parameters of that model can be used to create the imporved best model for the original 21 features.

Import library to perform principle component analysis.

Create components that would capture 95% of variance.

It can be seen that 13 components capture 95% variance.

Check the variance captured by each of the 13 principle components.

It can be observed that the first six principle components, which are half of the total capturing 95% variance, capture more than 70% of the variance.

So, its computationally even more effective to create a set of six principle components and then perform grid search on them.

Cross check that the six principle components capture more than 70% variance.

PCA Coding Refernces:

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html ,

https://www.youtube.com/watch?v=8klqIM9UvAc

Create a base model for principle components.

Create a dictionary of kernels to find the kernal with best model score through grid search.

Import library to perform grid search.

NOTE: To get cross validated scores for grid search, GridSearchCV is used.

Pass the base model and the dictionary of kernals as parameters to grid search object, to search for best kernal for the model.

Fit the grid search object to the set of six principle components of the training data.

Create a pandas data frame of the grid search result, for easy viewing.

Print the data frame of grid search.

Check the best score out of scores for different kernels of the grid search.

Check the best performing kernel of the grid search.

Grid Search Coding Refernces:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV ,

https://www.youtube.com/watch?v=HdlDYng8g9s

Over multiple runs of the script it was observed that the best performing kernel of the grid search changes sometimes. Therefore, it is assigned a variable that will have the value of whatever the best performing kernel is.

Create another object of classifier passing the best kernel from grid search as an argument.

Create a dictionary of gamma values to find the gamma with best model score through grid search.

Over multiple runs of the script, it was observed that the best gamma value is consistently 0.001. So, instead of assigning a variable to it, the value is to be directly passed as an argument to the next SVM object.

Create a dictionary of regularization parameter C to find the C with best model score through grid search.

Over multiple runs of the script it was observed that the best performing C of the grid search changes sometimes. Therefore, it is assigned a variable that will have the value of whatever the best performing C is.

After tuning the hyperparameters, a SVM classifier object with all the best parameters obtained from grid search is created.

To check the generalized accuracy of this model over the priciple components, a 5 fold cross validation was performed.

Then, another SVM classifier object is created with all the best parameters obtained from grid search, but this time to fit it to the original training data of 21 features.

To get an estimate of the generalized model score, a five fold cross validation was performed.

To see how much imporvement in model accuracy did the hyperparameter tuning bring, the mean cross validation score of the baseline model and the best model fit from grid search are to be compared.

Although a very small amount, but there is a slight improvement in the mean cross validation score over that of the baseline model. So, this can be considered the best SVM model for the training data.

The best SVM classifer object is to be created with all the best parametres from above, but this time even to check its performane over the test data set.

Import library required to calculate time taken for training a model.

Train the best SVM model over the training data consisting of all 21 features and also calculate the time taken for the training.

It took 263.52 seconds for the best SVM model to get trained.

Check the paramerters of the best SVM model.

Save the best SVM model into a pickle file to export it and test it directly over test set in a different notebook.

Load best SVM model from the pickle file that was created above, which will be tested hereafter.

Load features of test data from the pickle file that was created above, which will be tested hereafter.

Load target of test data from the pickle file that was created above, which will be tested hereafter.

Predict targets from features of test data.

Evaluate training score of the best SVM model.

Evaluate test score of the best SVM model.

Coding Reference for all SVM commands:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Import library required for creating ROC curve and calculating AUC.

Obtain false positive rate, true positive rate and thresholds for best SVM model.

Calculate AUC for best SVM model over test data.

Plot ROC curve for best SVM model over test data.

Coding Reference for ROC and AUC:

https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#

Import library required for plotting confusion matrix.

Plot confusion matrix for best SVM model over test data.

Coding Reference for Confusion matrix:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html

Import library for obtaining classification report.

Produce Classification report of best SVM model over test data.

Coding Reference for Classification Report:

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

Preperation of MLP model

Import library for MLP classifier.

Create a baseline MLP model without passing any arguments to the classifier object.

Perform a 5 fold cross validation of the baseline model with complete training data set.

Take the average of cross validation scores for all the 5 folds.

Check the parameters of the baseline model.

While performing 5 fold cross validation above, it was observed that there were a series of errors of optimization not converging even after reaching the maximum number of iterations of 200. It is a good idea to see how the baseline model performance changes for higher value of maximum number of iterations. So, another version of baseline model with 500 (randomly chosen) maximum number of iterations is created and a 5 fold cross validation is performed.

It can be seen that with increase in maimum number of iterations to 500 from 200 in original baseline model, although the optimization has converged for all the 5 folds of cross valiadation, but the mean cross validation score has decreased by few decimal points. So, the original baseline model seems to be better than the one with increase in maximum number of iterations.

As a starting point of tuning the number of hidden layers and number of neurons in each layer, we will start with a single hidden layer based on universal approximation theorem. As per rules of thumb, we choose the number of neurons as follows:

1) 10 - which is between the size of input and output layers.

2) 16 - which is 2/3 the size of input layer + size of output layer.

3) 30 - which is less than twice the size of the input layer.

These concepts have adapted from: https://www.heatonresearch.com/2017/06/01/hidden-layers.html .

In addition we also use 50 and 80 to see how the model scores change as they approach 100, which is the number of neurons in the original baseline model.

The best 3 configurations from previous grid search are taken and another layer with 2 neurons is added.

It can be seen that the increase in number of hidden layers has decreased the model score a bit. To check if this was the effect of total number of neurons in all hidden layers, the number of neurons in the first hidden layer is reduced in the next grid search.

It is now clear that despite having the same number of total neurons in all the hidden layers, as that in the configuration of single hidden layer, the model scores have come down further. So, it seems single hidden layer gives better model scores than mutiple hidden layers. The best score for single hidden layer configuration was given by 16 neurons. To cover for the uncertainity that a different number of neurons may give the best score in a different run of that grid search, the number of neurons with best score is to be assined a variable, which is then to be passed as an argument to the MLP model object.

Create a dictionary of activation functions to find the activation with best model score through grid search.

It can be seen that the model score has surged up a bit with change of activation function. However, the best activation function changes on different runs of the grid search. So, the best activation function resulting from the grid search is assigned a variable, which is then passed as an argument in the classifer object.

Create a dictionary of weight optimization solver to find the solver with best model score through grid search.

After mutiple runs of the above grid search adam was consistently the best solver for weight optimization. In addition, adam and sgd solvers allow tuning few other hyperparameters, which can't be done with lbfgs. Also, lbfgs is more suitable for smaller datasets. So, for this case, the solver adam is directly passed as an argument to the classifier object, rather than being assigned to a variable.

Create a dictionary of L2 regularization parameter alpha to find the alpha with best model score through grid search.

Over multiple runs, the model scores for all the three values of alpha, which offers L2 regularization, have been close. But the best performing alpha was not consistent. So, the best performing alpha value is to be assigned a variable rather the passing it directly to the classifier object.

Create a dictionary of batch size of mini batches for optimization to find the batch size with best model score through grid search.

It can be observed that the model scores are same for batch_size value 200 and 'auto'. This is essentially because for 'auto' the batch_size is 200 if no lesser value is assigned, as per sklearn MLPClassifier documentaion ( https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html ). So, it would be better to pass 'auto' which is the defaul value, as an argument to the model object.

Create a dictionary of learning rate for weight updates to find the learning rate with best model score through grid search.

It can be seen that the model scores for all the three types of learning rate schedules for weight update are same and this behaviour has been observed in multiple runs. So, the best performing one is assigned a variable, which is then passed as an argument to the classifier object.

Create a dictionary of initial learning rates to find the initial learning rate with best model score through grid search.

Clearly, the default value of 0.001 gives the best model score. So, its better to keep the initial learning rate to the default value.

Create a dictionary of maximum iterations to find the maximum iterations with best model score through grid search.

It can be observed that the scores for all iteration values are same and this pattern was observed in multiple runs. Since, the same score was achieved for iteration value of 200, which is default value. So, its better to keep the maximum number of iterations to its default value, keeping the computational cost low.

Create a dictionary of momentum values for gradient descent updates to find the momentum with best model score through grid search.

It is clear that the model scores for all values of momentum for gradient descent update is the same. So, here too, its better to keep it to the default value of 0.9.

Create a dictionary of early stopping conditions to find the condition with best model score through grid search.

As indicated by the results of the above grid search, the model scores for with and without early stopping is very small. The default value is 'False'. But, as per sklearn MLPClassifier documentaion ( https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html ), setting the value to 'True', 10% of the training data is set aside as validation set automatically and training is stopped when there is no further interative improvement in the validation score. So, from model generalization point of view, its better to pass 'True' as an argument to the classifier object.

Create a MLP classifier model with all the best parameters from above series of grid search.

To estimate a gerneralized score of this model, perform a 5 fold cross validation over the training data.

To see how much imporvement in model accuracy did the hyperparameter tuning bring, the mean cross validation score of the baseline model and the best model fit from grid search are to be compared.

Eventhough this is a small imporvement, its higher than that seen for SVM. So, this imporved model can be considered as the best model for MLP.

The best MLP classifer object is to be created with all the best parametres from above, but this time even to check its performane over the test data set.

Train the best MLP model over the training data and calculate the time for training as well.

The training time for best MLP model is surpisingly less than that for best SVM model.

Check the parameters of the best MLP model.

Save the best MLP model into a pickle file to export it and perdict on test data with it in another notebook.

Load best MLP model from the pickle file that was created above, which will be tested hereafter.

Predict target values for given test features through the best MLP model.

Evaluate training score of the best MLP model.

Evaluate test score of the best MLP model.

Coding Reference for all MLP commands:

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

Plot loss curve for best MLP model.

Coding Reference for loss curve:

https://michael-fuchs-python.netlify.app/2021/02/03/nn-multi-layer-perceptron-classifier-mlpclassifier/

Obtain false positive rate, true positive rate and thresholds for best SVM model.

Calculate AUR for best MLP model over test data.

Plot ROC curve for best MLP model over test data.

Plot confusion matrix for best MLP model over test data.

Produce Classification report for best MLP model over test data.

* END ***